Conversation
…ance Creates a comprehensive benchmark (playback-pipeline-benchmark) that measures each stage of the playback pipeline independently: - Decode-only performance (frame retrieval latency) - Full pipeline (decode + GPU render + readback) at multiple resolutions - Scrubbing performance (random-access frame rendering) Reports min/avg/p50/p95/p99/max statistics for each stage, effective FPS, and frame budget utilization analysis. Usage: cargo run -p cap-editor --example playback-pipeline-benchmark -- \ --recording-path /path/to/recording [--fps 60] [--frames 300] Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…ance Creates a comprehensive benchmark (playback-pipeline-benchmark) that measures each stage of the playback pipeline independently: - Decode-only performance (frame retrieval latency) - Full pipeline (decode + GPU render + readback) at multiple resolutions - Scrubbing performance (random-access frame rendering) Reports min/avg/p50/p95/p99/max statistics for each stage, effective FPS, and frame budget utilization analysis. Usage: cargo run -p cap-editor --example playback-pipeline-benchmark -- \ --recording-path /path/to/recording [--fps 60] [--frames 300]
Key changes to playback.rs: - Increase PARALLEL_DECODE_TASKS from 4 to 6 (sustained) - Add INITIAL_PARALLEL_DECODE_TASKS of 8 during ramp-up phase - Increase PREFETCH_BUFFER_SIZE from 60 to 90 frames - Increase FRAME_CACHE_SIZE from 60 to 90 (matches decoder cache) - Reduce warmup threshold from 20 to 10 frames (faster first frame) - Reduce warmup timeout from 1000ms to 500ms - Reduce frame wait timeouts from 200ms to 100ms for better responsiveness - Reduce aggressive skip threshold from 10 to 6 frames behind - Reduce PREFETCH_BEHIND from 15 to 10 (focus more on forward frames) These changes should improve: - First-frame latency (faster warmup) - Sustained playback FPS (more parallel decoding) - Frame availability (larger cache and prefetch buffer) - Responsiveness (shorter wait timeouts) Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Key changes to playback.rs: - Increase PARALLEL_DECODE_TASKS from 4 to 6 (sustained) - Add INITIAL_PARALLEL_DECODE_TASKS of 8 during ramp-up phase - Increase PREFETCH_BUFFER_SIZE from 60 to 90 frames - Increase FRAME_CACHE_SIZE from 60 to 90 (matches decoder cache) - Reduce warmup threshold from 20 to 10 frames (faster first frame) - Reduce warmup timeout from 1000ms to 500ms - Reduce frame wait timeouts from 200ms to 100ms for better responsiveness - Reduce aggressive skip threshold from 10 to 6 frames behind - Reduce PREFETCH_BEHIND from 15 to 10 (focus more on forward frames) These changes should improve: - First-frame latency (faster warmup) - Sustained playback FPS (more parallel decoding) - Frame availability (larger cache and prefetch buffer) - Responsiveness (shorter wait timeouts)
Previously, YUV→RGBA conversion (compute shader) created its own command encoder and called queue.submit() independently, then the main render pass created another encoder and submitted again. This meant 2+ GPU submissions per frame. Changes: - Add convert_nv12_to_encoder() and convert_yuv420p_to_encoder() methods to YuvToRgbaConverter that dispatch compute passes on an external encoder instead of creating their own - Add prepare_with_encoder() to DisplayLayer that uses the batched conversion methods - Add prepare_with_encoder() to RendererLayers that passes the shared encoder through to display layer - Modify produce_frame() to create the command encoder first and pass it to both prepare and render phases Result: Single queue.submit() per frame instead of 2+, reducing GPU overhead and improving frame throughput. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously, YUV→RGBA conversion (compute shader) created its own command encoder and called queue.submit() independently, then the main render pass created another encoder and submitted again. This meant 2+ GPU submissions per frame. Changes: - Add convert_nv12_to_encoder() and convert_yuv420p_to_encoder() methods to YuvToRgbaConverter that dispatch compute passes on an external encoder instead of creating their own - Add prepare_with_encoder() to DisplayLayer that uses the batched conversion methods - Add prepare_with_encoder() to RendererLayers that passes the shared encoder through to display layer - Modify produce_frame() to create the command encoder first and pass it to both prepare and render phases Result: Single queue.submit() per frame instead of 2+, reducing GPU overhead and improving frame throughput.
Previously finish_encoder() would: 1. Discard any pending readback result 2. Submit current frame readback 3. Wait synchronously for current readback Now it: 1. Wait for previous frame's readback (should already be done) 2. Submit current frame's readback (starts async) 3. Return the previous frame immediately This means GPU readback of frame N overlaps with CPU processing and decode of frame N+1, reducing the synchronous wait time per frame. The first frame still waits synchronously, but subsequent frames benefit from the pipelining. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously finish_encoder() would: 1. Discard any pending readback result 2. Submit current frame readback 3. Wait synchronously for current readback Now it: 1. Wait for previous frame's readback (should already be done) 2. Submit current frame's readback (starts async) 3. Return the previous frame immediately This means GPU readback of frame N overlaps with CPU processing and decode of frame N+1, reducing the synchronous wait time per frame. The first frame still waits synchronously, but subsequent frames benefit from the pipelining.
Audio sync improvements: - Pre-rendered audio: tighten jump detection threshold from 100ms to 50ms (audio now re-syncs when video playhead drifts by more than 50ms) - macOS (non-prerendered): reduce sync threshold from 120ms to 80ms - Windows (non-prerendered): reduce sync threshold from 200ms to 100ms - Windows: reduce hard seek threshold from 500ms to 300ms - Windows: reduce min sync interval callbacks from 50 to 30 for more responsive correction These changes address user reports of audio being behind video or taking a couple of seconds to sync up during playback. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Audio sync improvements: - Pre-rendered audio: tighten jump detection threshold from 100ms to 50ms (audio now re-syncs when video playhead drifts by more than 50ms) - macOS (non-prerendered): reduce sync threshold from 120ms to 80ms - Windows (non-prerendered): reduce sync threshold from 200ms to 100ms - Windows: reduce hard seek threshold from 500ms to 300ms - Windows: reduce min sync interval callbacks from 50 to 30 for more responsive correction These changes address user reports of audio being behind video or taking a couple of seconds to sync up during playback.
ZoomFocusInterpolator was cloning CursorEvents (Vec of cursor moves and clicks) on every frame during playback. For recordings with many cursor events, this adds unnecessary allocation pressure. Changes: - Add ZoomFocusInterpolator::new_arc() that accepts Arc<CursorEvents> - Playback and preview paths now use Arc sharing instead of deep cloning - Reduces per-frame allocation during playback - Renderer channel capacity reduced from 8 to 4 to reduce stale frame queuing and wasted decode work Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
ZoomFocusInterpolator was cloning CursorEvents (Vec of cursor moves and clicks) on every frame during playback. For recordings with many cursor events, this adds unnecessary allocation pressure. Changes: - Add ZoomFocusInterpolator::new_arc() that accepts Arc<CursorEvents> - Playback and preview paths now use Arc sharing instead of deep cloning - Reduces per-frame allocation during playback - Renderer channel capacity reduced from 8 to 4 to reduce stale frame queuing and wasted decode work
- Reduce decoder frame timeout from 5000ms to 2000ms (normal frames) - Reduce decoder initial seek timeout from 20000ms to 10000ms - Reduce GPU buffer wait timeout from 30s to 10s - Improve GPU readback poll loop: use progressive backoff instead of tight busy-wait (yield for first 10 polls, 100us sleep for 10-100 polls, 1ms sleep after that) - Reduces CPU usage during GPU readback waiting while maintaining low latency for fast readbacks Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Reduce decoder frame timeout from 5000ms to 2000ms (normal frames) - Reduce decoder initial seek timeout from 20000ms to 10000ms - Reduce GPU buffer wait timeout from 30s to 10s - Improve GPU readback poll loop: use progressive backoff instead of tight busy-wait (yield for first 10 polls, 100us sleep for 10-100 polls, 1ms sleep after that) - Reduces CPU usage during GPU readback waiting while maintaining low latency for fast readbacks
Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The pipelined readback optimization returns the previous frame's readback result while submitting the current frame's readback. This means the very last frame in a render loop would be lost since no subsequent call would collect it. Changes: - Add flush_pending_readback() function to frame_pipeline - Add flush_pipeline() method to FrameRenderer - Call flush_pipeline() at end of render_video_to_channel() to capture the last frame during exports - For playback, losing the last frame is acceptable (playback loop manages its own frame counting) Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The pipelined readback optimization returns the previous frame's readback result while submitting the current frame's readback. This means the very last frame in a render loop would be lost since no subsequent call would collect it. Changes: - Add flush_pending_readback() function to frame_pipeline - Add flush_pipeline() method to FrameRenderer - Call flush_pipeline() at end of render_video_to_channel() to capture the last frame during exports - For playback, losing the last frame is acceptable (playback loop manages its own frame counting)
|
Cursor Agent can help with this pull request. Just |
Adds periodic (every 2s) tracing::info! logs during playback that report: - Effective FPS (frames rendered / elapsed time) - Total frames rendered and skipped - Cache hit count (frames served from LRU cache) - Prefetch hit count (frames from prefetch buffer) - Sync decode count (frames decoded synchronously on-demand) - Current prefetch buffer utilization This data helps diagnose playback performance issues in real-world usage without requiring the benchmark tool. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Adds periodic (every 2s) tracing::info! logs during playback that report: - Effective FPS (frames rendered / elapsed time) - Total frames rendered and skipped - Cache hit count (frames served from LRU cache) - Prefetch hit count (frames from prefetch buffer) - Sync decode count (frames decoded synchronously on-demand) - Current prefetch buffer utilization This data helps diagnose playback performance issues in real-world usage without requiring the benchmark tool.
Convert RGBA frames to NV12 format before sending over WebSocket, reducing per-frame data from width*height*4 bytes (RGBA) to width*height*1.5 bytes (NV12) — a 62.5% bandwidth reduction. At half-res (1248x702) this reduces per-frame size from ~3.5MB to ~1.3MB, dropping bandwidth needs from ~210MB/s to ~78MB/s at 60fps. Changes: - Add WSFrameFormat enum (Rgba/Nv12) to WSFrame struct - Add WSFrame::from_rendered_frame_nv12() constructor that converts RGBA data to NV12 using the existing convert_to_nv12() function - Add pack_ws_frame() helper that selects correct packing format - Editor window frame callbacks now use NV12 format - Screenshot editor and camera legacy paths remain RGBA - WebSocket stats logging now reports actual format (NV12/RGBA) The frontend already has full NV12 support via WebGPU compute shader (renderNv12FrameWebGPU) and Canvas2D fallback, so no frontend changes are needed. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Convert RGBA frames to NV12 format before sending over WebSocket, reducing per-frame data from width*height*4 bytes (RGBA) to width*height*1.5 bytes (NV12) — a 62.5% bandwidth reduction. At half-res (1248x702) this reduces per-frame size from ~3.5MB to ~1.3MB, dropping bandwidth needs from ~210MB/s to ~78MB/s at 60fps. Changes: - Add WSFrameFormat enum (Rgba/Nv12) to WSFrame struct - Add WSFrame::from_rendered_frame_nv12() constructor that converts RGBA data to NV12 using the existing convert_to_nv12() function - Add pack_ws_frame() helper that selects correct packing format - Editor window frame callbacks now use NV12 format - Screenshot editor and camera legacy paths remain RGBA - WebSocket stats logging now reports actual format (NV12/RGBA) The frontend already has full NV12 support via WebGPU compute shader (renderNv12FrameWebGPU) and Canvas2D fallback, so no frontend changes are needed.
Add evict_far_from() method to FrameCache that removes entries far from the current playhead position. Called when aggressive frame skipping occurs (playback falls behind) to prevent the cache from holding stale frames that will never be used again. This keeps the LRU cache focused on frames near the current playhead, improving cache hit rates during normal sequential playback. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Add evict_far_from() method to FrameCache that removes entries far from the current playhead position. Called when aggressive frame skipping occurs (playback falls behind) to prevent the cache from holding stale frames that will never be used again. This keeps the LRU cache focused on frames near the current playhead, improving cache hit rates during normal sequential playback.
Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously the camera layer created its own command encoder and called queue.submit() for both YUV→RGBA conversion and texture copy — 2 extra GPU submissions per frame when camera is visible. Changes: - Add copy_from_yuv_output_to_encoder() that copies using external encoder - Add prepare_with_encoder() to CameraLayer that uses batched convert_nv12_to_encoder/convert_yuv420p_to_encoder methods - Update RendererLayers::prepare_with_encoder() to use camera's batched path for both camera and camera_only layers Result: When camera is visible, saves 2-4 queue.submit() calls per frame, reducing GPU overhead. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously the camera layer created its own command encoder and called queue.submit() for both YUV→RGBA conversion and texture copy — 2 extra GPU submissions per frame when camera is visible. Changes: - Add copy_from_yuv_output_to_encoder() that copies using external encoder - Add prepare_with_encoder() to CameraLayer that uses batched convert_nv12_to_encoder/convert_yuv420p_to_encoder methods - Update RendererLayers::prepare_with_encoder() to use camera's batched path for both camera and camera_only layers Result: When camera is visible, saves 2-4 queue.submit() calls per frame, reducing GPU overhead.
Rewrite convert_to_nv12() for better performance: - Separate Y and UV computation with row-level bounds checking (one check per row instead of per-pixel) - Use row slices for sequential memory access - Process UV pairs in a tight while loop instead of branching per pixel - Use bit shift (>>1) instead of division for averaging - More cache-friendly: process each row's Y values contiguously, then UV values for even rows This conversion runs on every frame in the frame callback, so reducing its cost directly improves sustained FPS. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously, the WebSocket handler cloned the entire WSFrame (~1.3MB of NV12 data) on every frame from the watch channel. Now the watch channel stores Arc<WSFrame>, so the clone is just an atomic reference count increment. Changes: - Change create_watch_frame_ws to accept watch::Receiver<Option<Arc<WSFrame>>> - Add pack_ws_frame_ref/pack_nv12_frame_ref/pack_frame_data_ref that take &WSFrame references instead of consuming ownership - Update editor_window.rs to wrap WSFrame in Arc before sending - Update screenshot_editor.rs to wrap WSFrame in Arc before sending - Broadcast-based create_frame_ws (camera_legacy) unchanged as broadcast handles cloning differently This eliminates ~1.3MB of allocation per frame in the WS handler, reducing GC pressure and improving frame delivery throughput. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously, the WebSocket handler cloned the entire WSFrame (~1.3MB of NV12 data) on every frame from the watch channel. Now the watch channel stores Arc<WSFrame>, so the clone is just an atomic reference count increment. Changes: - Change create_watch_frame_ws to accept watch::Receiver<Option<Arc<WSFrame>>> - Add pack_ws_frame_ref/pack_nv12_frame_ref/pack_frame_data_ref that take &WSFrame references instead of consuming ownership - Update editor_window.rs to wrap WSFrame in Arc before sending - Update screenshot_editor.rs to wrap WSFrame in Arc before sending - Broadcast-based create_frame_ws (camera_legacy) unchanged as broadcast handles cloning differently This eliminates ~1.3MB of allocation per frame in the WS handler, reducing GC pressure and improving frame delivery throughput.
Add compute shader and Rust pipeline for converting rendered RGBA frames to NV12 format on the GPU before readback. This will reduce readback data size by 62% (from width*height*4 to width*height*1.5 bytes). New components: - shaders/rgba_to_nv12.wgsl: Compute shader that processes 4x2 pixel blocks, writing complete u32 words to avoid data races. Each thread produces 4 Y values (2 rows) and 2 UV pairs. - RgbaToNv12Converter: Creates compute pipeline, manages storage and readback buffers, dispatches conversion compute pass - PendingNv12Readback: Async readback with same progressive backoff poll loop as RGBA readback - Nv12RenderedFrame: Output frame with NV12 data + metadata Not yet wired into the render pipeline - infrastructure only. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Add compute shader and Rust pipeline for converting rendered RGBA frames to NV12 format on the GPU before readback. This will reduce readback data size by 62% (from width*height*4 to width*height*1.5 bytes). New components: - shaders/rgba_to_nv12.wgsl: Compute shader that processes 4x2 pixel blocks, writing complete u32 words to avoid data races. Each thread produces 4 Y values (2 rows) and 2 UV pairs. - RgbaToNv12Converter: Creates compute pipeline, manages storage and readback buffers, dispatches conversion compute pass - PendingNv12Readback: Async readback with same progressive backoff poll loop as RGBA readback - Nv12RenderedFrame: Output frame with NV12 data + metadata Not yet wired into the render pipeline - infrastructure only.
…reduces readback by 62% Major optimization: instead of reading back full RGBA from GPU (width*height*4) and converting to NV12 on CPU, the render pipeline now converts RGBA→NV12 on GPU using a compute shader and reads back only NV12 data (width*height*1.5). This eliminates two bottlenecks simultaneously: 1. GPU readback size reduced by 62% (e.g., 3.5MB → 1.3MB at half-res) 2. CPU RGBA→NV12 conversion (~1-2ms per frame) eliminated from render thread Pipeline flow change: Before: GPU render RGBA → readback RGBA → CPU convert NV12 → WS send NV12 After: GPU render RGBA → GPU convert NV12 → readback NV12 → WS send NV12 Implementation: - Add rgba_to_nv12.wgsl compute shader (processes 4x2 pixel blocks, writes complete u32 words to avoid data races) - Add RgbaToNv12Converter with compute pipeline, storage/readback buffers - Add finish_encoder_nv12() for NV12 readback path - Add FrameRenderer::render_nv12() method - Add EditorFrameOutput enum (Rgba/Nv12) for frame callback - Editor renderer now produces NV12 frames directly - Frame callback receives NV12 data without CPU conversion - Export path unchanged (still uses RGBA readback) Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…reduces readback by 62% Major optimization: instead of reading back full RGBA from GPU (width*height*4) and converting to NV12 on CPU, the render pipeline now converts RGBA→NV12 on GPU using a compute shader and reads back only NV12 data (width*height*1.5). This eliminates two bottlenecks simultaneously: 1. GPU readback size reduced by 62% (e.g., 3.5MB → 1.3MB at half-res) 2. CPU RGBA→NV12 conversion (~1-2ms per frame) eliminated from render thread Pipeline flow change: Before: GPU render RGBA → readback RGBA → CPU convert NV12 → WS send NV12 After: GPU render RGBA → GPU convert NV12 → readback NV12 → WS send NV12 Implementation: - Add rgba_to_nv12.wgsl compute shader (processes 4x2 pixel blocks, writes complete u32 words to avoid data races) - Add RgbaToNv12Converter with compute pipeline, storage/readback buffers - Add finish_encoder_nv12() for NV12 readback path - Add FrameRenderer::render_nv12() method - Add EditorFrameOutput enum (Rgba/Nv12) for frame callback - Editor renderer now produces NV12 frames directly - Frame callback receives NV12 data without CPU conversion - Export path unchanged (still uses RGBA readback)
…_of) Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The NV12 readback path was submitting and immediately waiting for each frame, losing the pipelining benefit of the RGBA path. Now it uses the same pattern: return previous frame's result while submitting current frame's readback. Changes: - Add dual readback buffers to RgbaToNv12Converter (alternating) - Refactor convert_and_readback into submit_conversion + start_readback - Store pending readback in converter (take_pending to retrieve) - finish_encoder_nv12 now returns previous frame while current renders - First frame still waits synchronously, subsequent frames pipelined This ensures the NV12 GPU path gets the same pipelining benefit as the RGBA path — GPU readback of frame N overlaps with rendering of frame N+1. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The NV12 readback path was submitting and immediately waiting for each frame, losing the pipelining benefit of the RGBA path. Now it uses the same pattern: return previous frame's result while submitting current frame's readback. Changes: - Add dual readback buffers to RgbaToNv12Converter (alternating) - Refactor convert_and_readback into submit_conversion + start_readback - Store pending readback in converter (take_pending to retrieve) - finish_encoder_nv12 now returns previous frame while current renders - First frame still waits synchronously, subsequent frames pipelined This ensures the NV12 GPU path gets the same pipelining benefit as the RGBA path — GPU readback of frame N overlaps with rendering of frame N+1.
The render_nv12 method was missing the retry-on-GPU-error logic that the RGBA render method has. If a GPU buffer mapping fails (which can happen transiently on some hardware), it now retries up to 3 times with progressive backoff — same as the RGBA path. On retry, both the RenderSession and the NV12 converter are reset to clear any stale GPU state. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The render_nv12 method was missing the retry-on-GPU-error logic that the RGBA render method has. If a GPU buffer mapping fails (which can happen transiently on some hardware), it now retries up to 3 times with progressive backoff — same as the RGBA path. On retry, both the RenderSession and the NV12 converter are reset to clear any stale GPU state.
When the GPU NV12 conversion fails (e.g., dimensions not a multiple of 4), the fallback path was wrapping RGBA data in Nv12RenderedFrame and sending it to the frontend tagged as NV12 format. The frontend WebGPU shader would try to decode NV12, producing incorrect colors. Fix: - Add GpuOutputFormat enum (Nv12/Rgba) to Nv12RenderedFrame - GPU NV12 path sets format=Nv12, fallback sets format=Rgba - Editor frame callback checks the format field: if Nv12, sends directly; if Rgba, does CPU RGBA→NV12 conversion before sending - Ensures the frontend always receives correctly formatted NV12 data Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When the GPU NV12 conversion fails (e.g., dimensions not a multiple of 4), the fallback path was wrapping RGBA data in Nv12RenderedFrame and sending it to the frontend tagged as NV12 format. The frontend WebGPU shader would try to decode NV12, producing incorrect colors. Fix: - Add GpuOutputFormat enum (Nv12/Rgba) to Nv12RenderedFrame - GPU NV12 path sets format=Nv12, fallback sets format=Rgba - Editor frame callback checks the format field: if Nv12, sends directly; if Rgba, does CPU RGBA→NV12 conversion before sending - Ensures the frontend always receives correctly formatted NV12 data
…ression The GPU RGBA→NV12 compute shader was adding too much GPU-side work at full resolution (1920x1080), causing playback to drop from ~60fps to ~24-30fps on M4 Max. The GPU compute overhead exceeded the readback bandwidth savings. Changes: - Revert editor renderer from render_nv12 back to render (RGBA output) - Restore renderer channel capacity from 4 back to 8 - CPU NV12 conversion still happens in the frame callback for WS bandwidth savings (~1-2ms, well within frame budget) The GPU NV12 infrastructure (shader, pipeline, pipelined readback) is preserved for future use when it can be made resolution-adaptive, but the editor now uses the standard RGBA render path that works well at all resolutions. All other optimizations remain active: - Batched GPU command submissions (YUV→RGBA + render in single encoder) - Pipelined RGBA readback (previous frame returned while current renders) - NV12 over WebSocket (CPU conversion, 62% bandwidth reduction) - Arc<WSFrame> (no deep clone) - Prefetch/decode parallelism (6-8 tasks, 90-frame cache) - Audio sync improvements - Performance instrumentation Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…ression The GPU RGBA→NV12 compute shader was adding too much GPU-side work at full resolution (1920x1080), causing playback to drop from ~60fps to ~24-30fps on M4 Max. The GPU compute overhead exceeded the readback bandwidth savings. Changes: - Revert editor renderer from render_nv12 back to render (RGBA output) - Restore renderer channel capacity from 4 back to 8 - CPU NV12 conversion still happens in the frame callback for WS bandwidth savings (~1-2ms, well within frame budget) The GPU NV12 infrastructure (shader, pipeline, pipelined readback) is preserved for future use when it can be made resolution-adaptive, but the editor now uses the standard RGBA render path that works well at all resolutions. All other optimizations remain active: - Batched GPU command submissions (YUV→RGBA + render in single encoder) - Pipelined RGBA readback (previous frame returned while current renders) - NV12 over WebSocket (CPU conversion, 62% bandwidth reduction) - Arc<WSFrame> (no deep clone) - Prefetch/decode parallelism (6-8 tasks, 90-frame cache) - Audio sync improvements - Performance instrumentation
…thub.com/CapSoftware/Cap into cursor/editor-playback-optimization-5f42
|
You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 8. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
| @@ -93,32 +36,39 @@ fn pack_nv12_frame( | |||
| let y_stride = width; | |||
There was a problem hiding this comment.
NV12 metadata currently hardcodes y_stride = width. Since WSFrame carries stride (and callers fill it for NV12), it seems safer to encode that stride here so the receiver can handle padded NV12 rows.
| uv_stride: u32, | ||
| ) -> Result<&wgpu::TextureView, YuvConversionError> { | ||
| let (effective_width, effective_height, _downscaled) = | ||
| validate_dimensions(width, height, self.gpu_max_texture_size)?; |
There was a problem hiding this comment.
validate_dimensions can return effective_width/effective_height that differ from the input. This function still uploads/dispatches using the original width/height, which can exceed the allocated texture size when downscaling kicks in. Consider either early-returning when effective_* != * (since we’re not resampling here) or consistently using effective_* for uploads/dispatch after ensuring the planes match.
| uv_stride: u32, | ||
| ) -> Result<&wgpu::TextureView, YuvConversionError> { | ||
| let (effective_width, effective_height, _downscaled) = | ||
| validate_dimensions(width, height, self.gpu_max_texture_size)?; |
There was a problem hiding this comment.
Same thing for convert_yuv420p_to_encoder: effective_width/effective_height is computed but the rest of the function uses the original width/height for uploads and dispatch. If downscaling is possible here, it’d be good to make the behavior explicit (reject vs. handle) to avoid mismatched extents.
|
|
||
| let pending = nv12_converter | ||
| .take_pending() | ||
| .expect("just submitted a conversion"); |
There was a problem hiding this comment.
This expect can take the whole render down if something goes sideways; probably better to surface it as a RenderingError like the other failure paths.
| .expect("just submitted a conversion"); | |
| let pending = nv12_converter | |
| .take_pending() | |
| .ok_or(RenderingError::BufferMapWaitingFailed)?; |
| if let Some(Ok(final_frame)) = frame_renderer.flush_pipeline().await | ||
| && final_frame.width > 0 | ||
| && final_frame.height > 0 | ||
| { | ||
| sender | ||
| .send((final_frame, frame_number.saturating_sub(1))) | ||
| .await?; | ||
| } |
There was a problem hiding this comment.
This currently drops any flush error silently. Even if you don’t want to fail the whole render on a flush failure, a warn-level log here would make it much easier to spot pipeline issues.
| if let Some(Ok(final_frame)) = frame_renderer.flush_pipeline().await | |
| && final_frame.width > 0 | |
| && final_frame.height > 0 | |
| { | |
| sender | |
| .send((final_frame, frame_number.saturating_sub(1))) | |
| .await?; | |
| } | |
| if let Some(result) = frame_renderer.flush_pipeline().await { | |
| match result { | |
| Ok(final_frame) if final_frame.width > 0 && final_frame.height > 0 => { | |
| sender | |
| .send((final_frame, frame_number.saturating_sub(1))) | |
| .await?; | |
| } | |
| Err(e) => { | |
| tracing::warn!(error = %e, "Failed to flush pipelined readback"); | |
| } | |
| _ => {} | |
| } | |
| } |
| @@ -93,32 +36,39 @@ fn pack_nv12_frame( | |||
| let y_stride = width; | |||
| let metadata_size = 28; | |||
| let mut output = Vec::with_capacity(data.len() + metadata_size); | |||
| output.extend_from_slice(&data); | |||
| output.extend_from_slice(data); | |||
| output.extend_from_slice(&y_stride.to_le_bytes()); | |||
| output.extend_from_slice(&height.to_le_bytes()); | |||
| output.extend_from_slice(&width.to_le_bytes()); | |||
| output.extend_from_slice(&frame_number.to_le_bytes()); | |||
| output.extend_from_slice(&target_time_ns.to_le_bytes()); | |||
| output.extend_from_slice(&NV12_FORMAT_MAGIC.to_le_bytes()); | |||
|
|
|||
| output | |||
| } | |||
There was a problem hiding this comment.
NV12 path currently hardcodes y_stride = width and ignores WSFrame.stride. If the NV12 producer ever has padded rows, the receiver will misinterpret the buffer.
| fn pack_nv12_frame_ref( | |
| data: &[u8], | |
| y_stride: u32, | |
| width: u32, | |
| height: u32, | |
| frame_number: u32, | |
| target_time_ns: u64, | |
| ) -> Vec<u8> { | |
| let metadata_size = 28; | |
| let mut output = Vec::with_capacity(data.len() + metadata_size); | |
| output.extend_from_slice(data); | |
| output.extend_from_slice(&y_stride.to_le_bytes()); | |
| output.extend_from_slice(&height.to_le_bytes()); | |
| output.extend_from_slice(&width.to_le_bytes()); | |
| output.extend_from_slice(&frame_number.to_le_bytes()); | |
| output.extend_from_slice(&target_time_ns.to_le_bytes()); | |
| output.extend_from_slice(&NV12_FORMAT_MAGIC.to_le_bytes()); | |
| output | |
| } |
| WSFrameFormat::Nv12 => pack_nv12_frame_ref( | ||
| &frame.data, | ||
| frame.width, | ||
| frame.height, | ||
| frame.frame_number, | ||
| frame.target_time_ns, | ||
| ), |
There was a problem hiding this comment.
If you take the y_stride param in pack_nv12_frame_ref, this call site should pass through the stride from the frame.
| WSFrameFormat::Nv12 => pack_nv12_frame_ref( | |
| &frame.data, | |
| frame.width, | |
| frame.height, | |
| frame.frame_number, | |
| frame.target_time_ns, | |
| ), | |
| WSFrameFormat::Nv12 => pack_nv12_frame_ref( | |
| &frame.data, | |
| frame.stride, | |
| frame.width, | |
| frame.height, | |
| frame.frame_number, | |
| frame.target_time_ns, | |
| ), |
| self.ensure_texture_size(device, effective_width, effective_height); | ||
| self.swap_output_buffer(); | ||
|
|
||
| upload_plane_with_stride(queue, &self.y_texture, y_data, width, height, y_stride, "Y")?; |
There was a problem hiding this comment.
validate_dimensions can return effective_width/effective_height smaller than the inputs, but this function still uploads/dispatches using the original width/height. If downscaling kicks in, that looks like it can write past the allocated textures. Probably want to consistently use the effective dimensions for upload extents + dispatch (or explicitly bail when downscaled).
| let previous_frame = if let Some(prev) = session.pipelined_readback.take_pending() { | ||
| Some(prev.wait(device).await?) | ||
| } else { | ||
| None | ||
| }; |
There was a problem hiding this comment.
Small behavior change: previously a failed previous readback was best-effort (error dropped) and rendering continued; now it hard-errors the whole pipeline via ?. If map failures can be transient on some GPUs, keeping this non-fatal might be safer.
| let previous_frame = if let Some(prev) = session.pipelined_readback.take_pending() { | |
| Some(prev.wait(device).await?) | |
| } else { | |
| None | |
| }; | |
| let previous_frame = if let Some(prev) = session.pipelined_readback.take_pending() { | |
| prev.wait(device).await.ok() | |
| } else { | |
| None | |
| }; |
…thub.com/CapSoftware/Cap into cursor/editor-playback-optimization-5f42
Optimise playback performance by implementing a comprehensive benchmarking infrastructure, tuning decoder prefetch, batching GPU command submissions, pipelining GPU readbacks, and improving audio synchronization.
This PR addresses key bottlenecks such as the double GPU round-trip for frames, conservative prefetching, and synchronous GPU readbacks, aiming for smoother 60fps playback, faster scrubbing, and tighter audio-video sync across macOS and Windows.
Greptile Overview
Greptile Summary
Comprehensive performance optimization focused on eliminating GPU bottlenecks and improving playback smoothness through pipelined operations and smarter prefetching.
Key optimizations:
Architecture improvements:
prepare_with_encodervariants across display and camera layers enable command bufferingConfidence Score: 4/5
crates/editor/src/playback.rstimeout tuning andcrates/rendering/src/frame_pipeline.rsGPU buffer timeout reductionImportant Files Changed
prepare_with_encodervariant to batch YUV conversion with other GPU work instead of submitting separatelyFlowchart
flowchart TD A[Playback Start] --> B[Prefetch Task Spawned] A --> C[Main Playback Loop] B --> D{Frames Decoded < 15?} D -->|Yes| E[Use 8 Parallel Tasks] D -->|No| F[Use 6 Parallel Tasks] E --> G[Decode Frames Ahead] F --> G G --> H[Send to Prefetch Buffer] H --> I[LRU Frame Cache] C --> J{Frame in Cache?} J -->|Yes| K[Cache Hit - Use Cached] J -->|No| L{Frame in Prefetch Buffer?} L -->|Yes| M[Prefetch Hit - Use Frame] L -->|No| N{Frame In-Flight?} N -->|Yes| O[Wait 100ms for Frame] N -->|No| P[Sync Decode with Timeout] K --> Q[Render Pipeline] M --> Q O --> Q P --> Q Q --> R[Prepare Layers with Encoder] R --> S[YUV to RGBA Conversion] S --> T[Batch GPU Commands] T --> U{Output Format?} U -->|RGBA| V[Pipelined RGBA Readback] U -->|NV12| W[RGBA to NV12 Conversion] W --> X[Pipelined NV12 Readback] V --> Y[Triple Buffer Rotation] X --> Y Y --> Z[Return Previous Frame] Z --> AA[WebSocket Send] AA --> AB{Frames Behind?} AB -->|Yes > 6| AC[Skip Frames] AB -->|No| C AC --> AD[Evict Cache] AD --> CLast reviewed commit: f60bbbe